ReneWind

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description

Importing libraries

Loading Data

There appears to be some missing values in the test dataset that will need to be imputed when the missing values from the train and validation sets are imputed as well.

There are 40 columns of variables that are not labeled for a blind modeling. All but 2 rows have 20000 data points. All columns are float64 datatype, so no need to alter classifications. The Target column is an int64 datatype to indicate whether the machinery/process has failed (1) or whether it is still functioning properly.

V1 and V2 both have 18 missing values.

40 variable columns with one Target column. Top rows of data look okay.

EDA

The target dataset is definitely imbalanced. Something to consider when processing the data later to help improve model performance.

All the variables seem to oscillate between a negative up to a positive value, with a variable standard deviation. The values vary between -20 and 20 for all the variables.

Plotting histograms and boxplots for all the variables

Plotting all the features at one go

All the variables appear to be roughly normally distributed with some skewed left or right slightly. None of the outliers appear to be too far removed from the rest of the dataset to warrant any removal. Especially since no information aside from "variable" is given, so all the data will be kept.

Some variables show a general linear trend, but most do not appear to have a high linear correlation.

There are some high positive and high negative correlations (between -0.8 - 0.84). However, with no information about the variables, we do not see a correlation high enough (1) to warrant removal of any data columns for modeling.

Data Pre-processing

No variables need to be dropped at this time. So we will split the data prior to dealing with missing values.

Model Building

Model evaluation criterion

The nature of predictions made by the classification model will translate as follows:

Which metric to optimize?

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

Defining scorer to be used for cross-validation and hyperparameter tuning

Model Building with original data

Sample Decision Tree, Bagging, Random Forest, GBM, AdaBoost, XGBoost, and Logistic Regression models building with original data

XGBoost is giving the highest cross validation followed by Decision Tree. Those two models will be chosen for tuning to see if optimizing hyperparameters can yield an improved model.

Model Building with Oversampled data

This model comparison with oversampled data also shows XGBoost as having the highest performance, with Random forest coming in second. Decision Tree still has a high overall performance.

Model Building with Undersampled data

XGBoost also has a high performance in the undersampled data modeling. However, undersampling seems to lower models performance overall. GBM and AdaBoost also have good performance with undersampled data, while the Decision Tree has the worst performance of all the models.

HyperparameterTuning

Sample Parameter Grids

Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.

param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }

param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }

param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }

param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }

param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }

param_grid = {'C': np.arange(0.1,1.1,0.1)}

param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }

Sample tuning method (RandomizedSearchCV) for Decision tree and XGBoost with original data

Best parameters for the Decision Tree and XGBoost on the original dataset will be used later.

Sample tuning method for Decision tree and XGBoost with oversampled data

Best parameters for the Decision Tree and XGBoost on the oversampled dataset will be used later.

Sample tuning method for Decision tree and XGBoost with undersampled data

Best parameters for the Decision Tree and XGBoost on the undersampled dataset will be used later.

Model performance comparison and choosing the final model

Original data Model for Decision Tree using Optimized Parameters:{'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5}

Original data Model for XGBoost using Optimized Parameters:{'subsample': 0.8, 'scale_pos_weight': 10, 'n_estimators': 150, 'learning_rate': 0.1, 'gamma': 3} with CV score=0.8378745370889911:

Oversampled data Model for Decision Tree using Optimized Parameters: {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 2} with CV score=0.9123889677716:

Oversampled data Model for XGBoost using Optimized Parameters: {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.996206089831813:

Undersampled data Model for Decision Tree using Optimized Parameters: {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 11} with CV score=0.8363483335203681:

Undersampled data Model for XGBoost using Optimized Parameters: {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 150, 'learning_rate': 0.2, 'gamma': 3} with CV score=0.9173942318482775:

Over and Under sampling the data with RandomizedSearchCV showed significant improvements to the recall of both models on the validation datasets. RandomizedSearchCV was chosen due to the large dataset to increase modeling efficiency.

Test set final performance

XGBoost model with the parameters below was chosen as the final model due to the excellent Recall on the training and validation datasets.

Recall on the validation set came out to be 1, which is a great result for the model.

Most important feature is V36, followed by V24, V40, V6 etc.

Pipelines to build the final model

The pipeline works well on the undersampled data to quickly yield calculations. The Recall on the test datset is still 84.7% which is good.

Business Insights and Conclusions

Features in order of importance:
V36
V24
V40
V6
V14
V12
V15
V3
V16